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2.  FINAL  PROGRESS  REPORT 

2.1.  Statement  of  the  Problem  Studied 

Demands  for  databases  with  high  efficiency  and  high  throughput  have  led  to  the  recent 
development  of  several  types  of  database  systems  utilizing  parallel  processors.  This  project 
involved  research  aiming  to  improve  the  state  of  the  art  of  highly  parallel  database  systems. 
Both  the  logical  properties  (usability)  and  the  physical  properties  (efficiency)  were  enhanced. 
We  developed  very  efficient  algorithms  for  parallel  database  management  systems  in 
semantic/object-oriented  models.  Our  approach  has  several  advantages  over  the  currently 
known  theory  and  results  on  database  machines: 

•  Unlike  the  current  database  machines  based  on  the  contemporary  Relational  Model  of 
databases,  our  work  is  based  on  semantic  data  models  (including,  in  the  broad  sense, 
object-oriented  models  as  well  as  storage  of  multi-media  data).  The  use  of  semantic 
models  assures  better  logical  properties:  friendlier  and  more  intelligent  user  interfaces, 
comprehensive  enforcement  of  integrity  constraints,  improved  database  design,  greater 
flexibility,  and  substantially  shorter  application  programs  (which  reduces  the 
programming  effort  and  facilitates  program  verification). 

•  At  the  physical  level,  the  system  is  more  efficient  than  existing  database  systems.  The 
algorithms  and  prototype  system  developed  are  highly-efficient.  In  particular,  the  use  of 
the  semantic  model  allows  better  exploitation  of  the  parallelism. 

This  project  focused  on  two  problems:  concurrency  control  and  load  balancing. 

2.2.  Summary  of  the  Most  Important  Results 

2.2.1.  Concurrency  control 

We  have  developed  an  improved  semantic  optimistic  concurrency  control  algorithm  and  a 
query  optimization  technique  (lazy  queries)  that  can  be  used  in  a  massively  parallel  B-tree 
with  variable-length  keys.  B-trees  with  variable-length  keys  can  be  effectively  used  in  a 
variety  of  database  types.  In  particular,  we  are  using  this  B-tree  structure,  which  also  offers 
data  compression,  in  our  implementation  of  a  semantic  object-oriented  DBMS.  Our 
concurrency  control  algorithm  uses  semantically  safe  optimistic  virtual  "locks"  that  achieve 
very  fine  granularity  in  conflict  detection.  Our  algorithm  contributes  smaller  transaction 
conflict  probability  by  using  very  fine  granularity  (attribute  or  string  level  granularity).  We 
have  proven  that  the  algorithm  ensures  serializability  and  external  consistency.  Our 
algorithm  uses  local  logical  clocks  and  does  not  require  physical  clock  synchronization.  A 
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lazy  query  execution  algorithm  is  used  to  reduce  the  client-server  traffi  ^  • 
granularity  of  concurrency  control  by  minimizine  the  nnmht  f  ^ 

2.2.2.  Load  balancing 

transactions  that  transfer  small  strin?  inie  i  /  Performed  as  a  series  of  load  balancing 
transactions  requirrat"™^  ^  ™"'er.  Load  balancing® 

balancing  transitions  shouiarwe  ,  cZinaL  ta 
time  is  large,  a  centralized  algorithm  neloirwel  It 

In  our  B-tme.  a  centralized  load  blncl?  a  ,  ‘  'y*'™  bottleneck, 

distribution  statistics  from  all  partitions  and  heuristiLtl  collects  data  and  load 

initiating  the  load  balancing  1^““^  af 

load  balancing  transactions  the  load  balancing  mnH  i  partitions.  Apart  from  the 

empty  partition  which  is  no  longeTrefetd  o  ”  delete  an 

control  algorithm  maintains  its  safety  and  efficiency  ItStllndfnTload  bS” 
performed  according  to  our  algorithm.  ^  iinstanding  load  balancing  activity 

2.2.3.  Benchmarks 

benchmarks  is  to^show  th^t™  htle  our  ^DB^oS^^bT'^P^n  P‘^''Pose  of  the 

properties,  its  perfonnance  is  n logical 
other  DBMS’s.  ’  better  than  that  of  the  best 

Our  first  semantic  benchmark,  SB  1,  run  on  a  PenHum  onn 
RAM,  showed  that  on  certain  tvne.  nf  nn.V;  eu  c  computer  with  128MB  of 

highly-optimized  fully-indexed  Oracle^datahL  ^  Database  is  30  times  faster  than  a 

databaselso  requireslbott  to  111“  rceUsSrut 
the  schema  of  the  Oracle  database,  we  managed  Jreduce  ^  of 

3  times  more  than  that  of  SDB  but  then  th^  c  a  ^  ^  ^  space  requirements  to  about 

SDB.  Allhough  the  resutlfllhel'y'ltrvlS  'han 

(reduce)  the  space  requirements  of  SDR  h^  ■  i  we  expect  to  further  improve 

algorithms  in  the  next  SDB  release.  ^  ementing  our  new  data  compression 

simii°™leTS'ln“^^^^  circnmslances,  has  shown 

emliled  SB2  BenckZk  (Consl^y  Z>2st 

on  a  partite  '’"'■“"‘"S  “>  it  work  fast 

means  that  the  same  installation  of  SDB  will  work  aqllly^wronluatlref  thl  arelse!| 
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on  a  server.  Oracle,  being  tuned  for  one  application,  will  not  perform  as  well  for  another 
application,  nor  will  it  perform  as  well  for  the  same  application  when  users  pose  new  types  of 
ad-hoc  queries. 

2.2.4.  Semantic  SQL 

While  not  directly  supported  under  this  project,  a  related  project  supported  by  funding 
leveraged  from  this  project  has  resulted  in  the  adaptation  of  SQL  (Structured  Query 
Language),  which  is  the  standard  language  for  relational  databases,  to  semantic  databases. 
The  original  purpose  of  this  adaptation  was  to  be  compatible  with  and  be  able  to 
communicate  with  relational  tools.  However,  it  turned  out  that  the  size  of  a  typical  SQL 
program  for  a  semantic  database  is  many  times  smaller  than  for  an  equivalent  relational 
database.  SQL  offers  significant  benefits  when  used  with  semantic  databases  and  although 
originally  intended  for  relational  databases,  offers  many  advantages  to  the  semantic  model. 
We  have  implemented  a  stand-alone  SQL  server  as  well  as  an  embedded-SQL  preprocessor. 

A  multi-user  semantic  database  engine  has  been  developed  and  is  now  in  the  testing 
phase.  A  user  interface  to  this  engine  has  been  developed  using  C-i-i-  and  is  also  in  the  testing 
phase.  Our  ODBC  driver  for  the  semantic  database  engine  is  now  fully  operational  and 
allows  SQL  querying  of  a  semantic  database  and  interoperability  with  relational  database 
tools  such  as  Microsoft  Access’s  Query-by-Example  tool.  Using  these  tools,  the  number  of 
user  keystrokes  required  is  in  correlation  to  the  size  of  the  generated  SQL  program.  Since  the 
SQL  programs  for  the  semantic  database  are  substantially  shorter,  the  third-party  query  tools 
are  much  more  ergonomic  with  the  semantic  database  than  with  the  relational  databases  for 
which  they  were  originally  designed. 

Semantic  SQL  is  detailed  in  the  appendix  entitled  Semantic  SQL. 
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Abstract 

This  paper  proposes  an  efficient  optimistic  concurrency  control  algorithm  and  a  query  optimization 
technique  (lazy  queries)  used  in  a  massively  parallel  B-tree  with  variable-length  keys.  B-trees  with 
variable-length  keys  can  be  effectively  used  in  a  variety  of  database  types.  In  particular,  we  show  how  such 
a  B-tree  is  used  in  our  implementation  of  a  semantic  object-oriented  DBMS.  Our  concurrency  control 
algorithm  uses  semantically  safe  optimistic  virtual  "locks"  that  achieve  very  fine  granularity  in  conflict 
detection.  We  prove  that  the  algorithm  ensures  serializability  and  external  consistency.  Our  algorithm  uses 
local  logical  clocks  and  does  not  require  physical  clock  synchronization.  Lazy  query  execution  algorithm 
is  used  to  reduce  the  client-server  traffic  and  improve  the  granularity  of  concurrency  control  by 
minimizing  the  number  of  optimistic  locks.  Most  relevant  processing  is  done  at  the  client  machines,  thus 
reducing  the  data  and  processing  overheads  at  parallel  B-tree  servers. 


1.  Introduction 

B-tree  data  structures  are  widely  used  in  implementation  of  databases.  B-trees  allow  to  insert, 
delete,  find,  and  retrieve  a  number  of  database  records  [Comer-79].  With  the  advent  of  parallel 
databases  that  can  store  thousands  of  terabytes  of  data  and  object-oriented  technology,  several 
new  properties  of  B-tree  data  structures  are  highly  desirable: 

•  Transparent  access  to  massive  volumes  of  data.  To  index  data  database  systems  use  keys  or 
object  identifiers.  A  B-tree  record  is  usually  divided  into  two  parts:  an  index  part  and  a  data 
part.  The  B-tree  records  are  indexed  lexicographically  by  the  index  part.  The  index  part  in  B- 
trees  is  usually  of  fixed  size  and  the  database  capacity  is  limited.  A  new  generation  of 
databases  (for  example,  [Rishe-92-DDS])  do  not  have  keys  at  all:  the  data  itself  serves  as  a 
key  of  varying  length.  In  a  semantic  database  implementation,  for  example,  each  entity 
comprises  a  large  number  of  strings,  each  of  which  corresponding  to  an  attribute  or 
relationship. 

•  On-line  transaction  processing  systems  demand  high  transactional  and  query  throughput. 
Such  throughput  requires  running  many  transactions  and  queries  in  parallel.  Efficient 
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concurrency  control  and  query  optimization  algorithms  are  necessary  to  resolve  conflicts 
between  concurrent  transactions. 

In  this  paper  we  propose  algorithms  of  a  new  B-tree  structure  that  combines  the  following 
properties: 

•  Parallel  multicomputer  operation.  We  employ  fast  optimistic  concurrency  control. 
Granularity  at  the  level  of  strings  is  attained.  However,  there  is  no  overhead  in  the  physical 
data  structure. 

•  Semantically  safe  optimistic  locks.  We  achieve  greater  degree  of  safety  in  transaction 
conflict  detection  than  in  many  algorithms  that  use  locking.  Even  if  transaction  relies  on 
absence  of  some  data  in  the  B-tree,  and  this  data  was  inserted  by  another  concurrent 
transaction,  a  conflict  will  be  detected. 

•  Efficient  query  execution  algorithm  -  lazy  queries  -  that  gives  faster  execution  of  complex 
queries  and  better  granularity  in  concurrency  control. 

•  Transparent  variable  size  keys.  In  our  B-tree,  the  whole  record  is  a  key  (it  is  up  to  the  B-tree 
client  how  to  divide  this  key  into  index  and  data  portion,  if  that  is  necessary).  A  record  in  our 
B-tree  is  called  a  string. 

•  String  data  compression.  All  strings  in  data  and  index  blocks  in  our  B-tree  are  compressed 
by  eliminating  common  prefixes.  Additionally,  the  index  strings  are  compressed  by 
eliminating  redundant  suffixes.  This  results  in  very  short  index  strings.  Index  eompression 
not  only  reduces  the  storage  requirements  but  also  accelerates  the  B-tree  operations  by 
keeping  more  index  data  in  a  memory  cache. 

•  Automatic  background  load  balancing  that  redistributes  the  data  among  the  database 
computers  to  equalize  the  data  and  transactional  load. 

•  Very  high  capacity.  The  size  of  our  parallel  B-tree  is  virtually  unlimited  (specifically,  it  is 

10^®  bytes  for  64  bit  computers). 

The  rest  of  this  paper  is  organized  as  follows.  Section  2  defines  the  logical  level  of  our  B-tree.  In 
Section  3  we  describe  a  semantic  database  application  that  utilizes  our  B-tree  and  show  some 
typical  database  queries.  Section  4  describes  a  query  optimization  technique  “lazy  queries”  that 
we  use  to  reduce  the  number  of  server  accesses  and  transaction  conflict  probability.  A  parallel  B- 
tree  architecture  is  presented  in  Section  5.  Our  concurrency  control  algorithm  is  presented  in 
Section  6. 

2.  Elementary  B-tree  operations 

We  define  B-tree  as  an  implementation  of  a  data  type,  each  instance  of  which  is  a  lexicographic 
ordered  set  of  strings  with  the  following  operations: 

1 .  Elementary  query  (interval)  operator  [/,  r],  where  /  and  r  are  arbitrary  strings. 

[l,r]S  =  {xeS\l  <x<r},  where  <  is  the  lexicographic  order  of  strings. 

2.  Update  operator.  Let  D  and  I  be  disjoint  sets  of  strings: 

S  +  (I,D)  =  (S-D)  u  I  (I.e.  we  remove  a  set  of  strings  D  and  insert  a  set  I  instead). 
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The  next  section  describes  how  these  elementary  operations  were  used  in  the  implementation  of 
semantic  binary  object-oriented  database. 

3.  Sample  Application:  Semantic  DBMS  implementation 

The  semantic  database  models  in  general,  and  the  Semantic  Binary  Model  SBM  ([Rishe-92- 
DDS]  and  others)  in  particular,  represent  the  information  as  a  collection  of  elementary  facts 
categorizing  objects  or  establishing  relationships  of  various  kinds  between  pairs  of  objects.  The 
central  notion  of  semantic  models  is  the  concept  of  an  abstract  object,  which  is  any  real  world 
entity  that  we  wish  to  store  information  about  in  the  database.  The  objects  are  categorized  into 
classes  according  to  their  common  properties.  These  classes,  called  categories,  need  not  be 
disjoint,  that  is,  one  object  may  belong  to  several  of  them.  Further,  an  arbitrary  structure  of 
subcategories  and  supercategories  can  be  defined.  The  representation  of  the  objects  in  the 
computer  is  invisible  to  the  user,  who  perceives  the  objects  as  real-world  entities,  whether 
tangible,  such  as  persons  or  cars,  or  intangible,  such  as  observations,  meetings,  or  desires. 

The  database  is  perceived  by  its  user  as  a  set  of  facts  about  objects.  These  facts  are  of  three 
types:  facts  stating  that  an  object  belongs  to  a  category;  facts  stating  that  there  is  a  relationship 
between  objects;  and  facts  relating  objects  to  data,  such  as  numbers,  texts,  dates,  images, 
tabulated  or  analytical  functions,  etc.  The  relationships  can  be  of  arbitrary  kinds;  for  example, 
stating  that  there  is  a  many-to-many  relation  address  between  the  category  of  persons  and  texts 
means  that  one  person  may  have  an  address,  several  addresses,  or  no  address  at  all. 

Logically,  a  semantic  database  is  a  set  of  facts  of  three  types:  categorization  of  an  object:  xC, 
relationship  between  two  objects:  xRy,  relationship  between  an  arbitrary  object  and  a  value: 
xRv.  Efficient  storage  structure  for  semantic  models  has  been  proposed  in  [Rishe-91-FS].  The 
collection  of  facts  forming  the  database  is  represented  by  a  file  structure  which  ensures 
approximately  one  disk  access  to  retrieve  any  of  the  following: 

1.  For  a  given  abstract  object  X,  verify/find  what  categories  the  object  belongs  to. 

2.  For  a  given  category,  find  its  objects. 

3.  For  a  given  abstract  object  x  and  relation  R,  retrieve  all  y  such  that  xRy. 

4.  For  a  given  abstract  object  y  and  relation  R,  retrieve  all  abstract  objects  x 
such  that  xRy. 

5.  For  a  given  abstract  object  X,  retrieve  (in  one  access)  all  (or  several)  of  its  categories  and 
direct  and/or  inverse  relationships,  i.e.  relations  R  and  objects  y  such  that  xRy  or  yRx.  The 
relation  R  in  xRy  may  be  an  attribute,  i.e.  a  relation  between  abstract  objects  and 
values. 

6.  For  a  given  relation  (attribute)  R  and  a  given  value  v,  find  all  abstract  objects  such 
that  xRv. 

7.  For  a  given  relation  (attribute)  R  and  a  given  range  of  values  [vy  ,  v^],  find  all 
objects  X  and  v  such  that  xRv  and  V/  <  v  <  v^  . 

We  call  the  operations  1  through  7  elementary  queries.  The  entire  database  can  be  stored  in  a 
single  B-tree.  This  B-tree  contains  all  of  the  facts  of  the  database  (xIC,  xRv,  xRy)  and  also 
additional  information  called  inverted  facts:  CIx,  Rvx,  and  yR/„vX  (Here,  I  is  the  pseudo¬ 
relation  IS-IN  denoting  membership  in  a  category).  The  inverted  facts  allow  to  keep  answers  to 
the  queries  2,  4,  6,  7  in  a  contiguous  segment  of  data  in  the  B-tree  and  answer  them  with  one 
disk  access  (when  the  query  result  is  much  smaller  than  one  disk  block).  The  direct  facts  xIC 
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and  xRy  allow  to  answer  the  queries  1, 3,  and  5  with  one  disk  access.  This  allows  both  sequential 
access  according  to  the  lexicographic  order  of  the  items  comprising  the  facts  and  the  inverted 
facts,  as  well  as  random  access  by  arbitrary  prefixes  of  such  facts  and  inverted  facts.  The  facts 
which  are  close  to  each  other  in  the  lexicographic  order  reside  close  in  the  B-tree.  (Notice, 
that  although  technically  the  B-tree-key  is  the  entire  fact,  it  is  of  varying  length  and  on  the 
average  is  only  several  bytes  long,  which  is  the  average  size  of  the  encoded  fact  xRy.). 

Numeric  values  in  the  facts  are  encoded  as  substrings  using  the  order-preserving  variable-length 
number  encoding  of  [Rishe-91-IB]. 


Table  1  summarizes  how  the  elementary 
semantic  queries  are  implemented  using  the  B- 
tree  interval  operators.  We  use  notation  S  +  1 
to  denote  a  string  derived  from  the  original 
string  S  by  adding  1  to  the  last  byte  of  S  (For 
strings  encoding  abstract  objects,  this 
operation  never  results  in  overflow). 


For  most  elementary  queries  (Queries  1,  3,  4, 

Table  1.  Implementation  of  elementary  queries  number  of  binary  facts  is  usually 

small.  Some  queries  (Queries  2  and  7), 
however,  may  result  in  a  very  large  number  of  facts  and  it  may  be  inefficient  to  retrieve  the 
whole  query  at  once. 

A  common  operation  in  databases  is  to  calculate  an  intersection  of  two  queries.  For  example, 
consider  a  query:  “Find  all  objects  from  category  Student  that  have  the  attribute  BirthYear 
1980”.  This  query  can  be  executed  using  several  scenarios: 

Scenario  1. 

a.  Retrieve  all  persons  born  in  1980:  execute  an  elementary  query  “BirthYear  1980  ?” 

b.  For  each  person  retrieved  in  the  step  a  verify  that  the  person  belongs  to  the  category  Student 

Scenario  2. 

a.  Retrieve  all  persons  bom  in  1980:  execute  an  elementary  query  “BirthYear  1980  ?” 

b.  Retrieve  all  students:  execute  an  elementary  query  “Student  ?” 

c.  Find  an  intersection  of  the  objects  retrieved  in  a  and  b. 

In  Scenario  1  we  retrieve  all  persons  from  all  categories  (Person,  Instmctor,  and  Student)  who 
were  born  in  1980  and  for  each  person  we  execute  an  additional  elementary  query  to  verify  that 
the  retrieved  person  is  a  student.  In  this  scenario  we  have  to  execute  a  large  number  of  small 
queries. 

In  Scenario  2  we  execute  only  two  elementary  queries  and  then  find  an  intersection  of  the  results. 
The  problem  is  that  the  elementary  query  “Student  ?”  may  result  in  a  very  large  set  of  binary 
facts.  Not  only  is  this  very  inefficient  in  terms  of  expensive  communication  between  client  and 
server,  but  also  such  big  query  would  be  affected  by  any  transaction  that  inserts  or  deletes 
students  and  our  query  would  be  aborted  more  often  than  the  query  in  the  Scenario  1 . 


Query 

B-tree  Implementation 

1.x? 

[xl,  xl+  1] 

2.  C? 

[Cl,  CI+  1] 

3.xR? 

[xR,  xR  + 1] 

4.  ?Rx 

[xR,„^,xR,„v+  1] 

5.x?? 

[x,  X  +  1] 

6.  ?Rv 

[Rv,  Rv  + 1] 

7.R[v/..v,]? 

[Rv;,Rv2  + 1] 
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Thus,  Scenario  1  is  obviously  better  in  our  case.  Consider  now  another  query:  “Find  all 
instructors  bom  in  1970”.  The  number  of  persons  born  in  1970  could  be  larger  or  comparable 
with  the  total  number  of  instructors.  In  this  case.  Scenario  2  would  be  much  more  efficient 
because  we  need  to  execute  only  two  elementary  queries. 

The  next  section  introduces  a  technique  of  lazy  elementary  query  execution  that  greatly  reduces 
the  number  of  disk  accesses,  the  server  traffic,  and  the  transaction  conflict  probability  by 
automatically  reducing  one  scenario  to  another.  For  example,  the  intersection  operator  get  a 
close-to-optimal  implementation  without  keeping  any  data  distribution  statistics. 


4.  Lazy  Queries 

In  our  B-tree  the  actual  query  execution  is  deferred  until  the  user  actually  requests  the  query 
results.  We  define  the  elementary  lazy  query  programmatic  interface  in  a  B-tree  B  as  follows: 

1 .  Q  :=[/,/•]  B  -  define  a  lazy  query  [/,  r]  but  do  not  execute  it  yet.  Let  Q.P  be  a  pointer  to 
future  results  of  the  query.  Initially  Q.P"'  :=  ”  ,  i.e.  P  points  to  an  empty  string. 

2.  Seek{Q,  x)  -  Moves  the  pointer  Q.P,  so  that  Q.P^  =  min{y  €  [/,  r]B  |  y  >  x}. 

The  actual  principal  operations  on  the  query  results  are  derived  from  the  above: 

1 .  Read(Q)  ;=  Q.P"'  -  reads  the  current  string  pointed  by  the  logical  pointer  Q.P.  This 
operation  results  in  an  error  if  Q.P  =  null. 

2.  Next(Q)  :=  S'^eA:(Q,  Read(Q)  +  0).  We  use  notation  s  +  0  to  denote  a  string  derived  from 
the  string  s  by  appending  a  zero  byte. 

When  the  Seek  operation  is  executed,  the  string  pointed  to  by  the  new  logical  pointer  is  fetched 
from  the  B-tree  and,  normally,  a  small  number  of  lexicographically  close  strings  is  prefetched 
and  placed  in  a  lazy  query  cache  buffer.  It  is  likely  that  the  next  Seek  operation  will  request  a 
string  whieh  is  already  in  the  cache  buffer,  so  only  a  few  Seek  operations  require  actual  disk  and 
server  access. 
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Many  queries  can  efficiently  use  the  Seek  operation. 
For  example,  we  can  find  the  intersection  of  two  lazy 
queries  Q;  and  very  efficiently:  construct  a  new 
lazy  query  (lazy  intersection)  Qj  where  the  Seek 
operation  uses  algorithm  shown  in  Figure  1. 


This  algorithm  gives  an  efficient  solution  for  the 
sample  queries  described  in  the  previous  section.  For 
the  query  “Find  all  objects  from  category  Student  that 
have  the  attribute  BirthYear  1980”  we  use  three  lazy 
queries: 

a.  Qj  :=  elementary  lazy  query  “BirthYear  1980  ?” 

b.  Q2  :=  elementary  lazy  query  “Student  ?” 

c.  Qs  •=  Qi  &  Q2 

Since  query  Qj  is  not  actually  executed,  our 
algorithm  that  finds  intersection  will  not  require  to 
fetch  every  student  from  the  database:  the  number  of 
actual  disk  accesses  to  retrieve  the  students  in  the  query  will  be  less  than  or  equal  to  the 
number  of  persons  born  in  1980.  Thus,  the  cost  of  the  lazy  query  will  be  smaller  than  the  cost 
of  the  optimal  solution  for  elementary  queries  in  Scenario  1  described  in  the  previous  section. 

For  the  query  “Find  all  instructors  born  in  1970”  we  use  three  similar  lazy  queries.  Since  the 
number  of  instructors  is  likely  to  be  small,  it  is  possible  that  all  instructors  will  be  fetched  in  the 
first  disk  access,  and  the  whole  query  will  require  a  number  of  server  accesses  close  to  2,  which 
is  the  optimal  number. 

Lazy  queries  not  only  result  in  a  smaller  number  of  server  accesses.  We  will  show  that  lazy 
queries  allow  to  improve  the  granularity  of  our  concurrency  control  algorithm  and  reduce  the 
transaction  conflict  probability. 

5.  Parallel  B-tree 

A  massively  parallel  B-tree  should  perform  many  queries  and  transactions  simultaneously  and  its 
size  should  scale  to  hundreds  of  terabytes  even  if  the  underlying  computer  hardware  supports 
only  32  bit  addressing.  This  can  be  achieved  by  splitting  the  B-tree  into  partitions  of  about  1 
gigabyte  in  size.  The  whole  B-tree  is  then  a  network  of  computers  where  each  computer  holds 
one  or  more  B-tree  partitions. 


Seek(Q3,y): 

Seek(Qj ,  x); 

Seek{(^2i^)^ 

while  (Q;  .P  mill  &  Q2  .P  null  & 
Q;.pA^  Q2.I»'^)  do 
if  Q/  .P^  >  Q2  ^.P  then 
Seek(Q2,Q,.P^) 

.  .else  ' 

Seek(Q;,Q2.P^); 

od; 

if  Q;  .P  =  null  or  Q2  .P  =  null  then 
Qj.P:=null 
else 

Q2.P:=Q2.P; 

Figure  1.  Algorithm  to  find  intersection 
Qi  ’=Qi  &  Qi 
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Figure  2:  Client-server  model  of  Parallel  B-tree 


The  B-tree  partitions  themselves  must  be 
indexed.  This  index  is  also  represented  as  a  B- 
tree  which  is  called  a  Partitioning  Map.  The 
Partitioning  Map  B-tree  can  reference 
approximately  200,000,000  B-tree  partitions. 
Since  each  partition  can  hold  up  to  about  1GB 
of  data,  the  total  addressable  B-tree  space  is 
limited  by  200,000  Terabytes.  For  64  bit 

computers  this  limit  becomes  lO’*  times  larger, 

which  is  about  10^^  bytes  and  is  beyond  any 
practical  database  size.  Figure  1  shows  typical 
client-server  model  of  parallel  B-tree. 


6.  Concurrency  Control 

Our  concurrency  control  algorithm  is  an  optimistic  algorithm  that  first  accumulates  a  transaction, 
and  then  performs  it  using  a  2-phase  commit  protocol  [Gray-79],  and  backward  validation 
[Haerder-84]  to  ensure  the  serializability  and  external  consistency  of  transactions.  Our  algorithm 
benefits  from  and  improves  upon  the  validation  technique  of  the  [Adya&al-95]  algorithm  for  an 
object-oriented  database.  Their  algorithm  uses  a  loosely  synchronized  physical  clocks  to  achieve 
global  serialization  and  detects  conflicts  at  the  object  level  granularity.  In  our  algorithm,  a  finer 
granularity  at  the  level  of  strings  is  attained  and  we  used  logical  clocks  to  achieve  global 
serialization;  nevertheless,  our  algorithm  does  not  require  maintaining  any  extra  data  per  string 
or  per  client. 

6- 1,  Accumulation  of  Transactions 

In  a  parallel  B-tree  updates  and  queries  made  by  a  client  should  be  verified  for  conflicts  with 
updates  and  queries  made  simultaneously  by  the  other  B-tree  clients.  A  transaction  is  a  group  of 
B-tree  updates  and  queries  which  is  guaranteed  to  be  consistent  with  the  queries  and  updates 
executed  concurrently  within  other  transactions.  To  create  such  a  group  of  operations  we  have 
several  B-tree  operations  in  addition  to  the  lazy  queries  defined  in  Section  4: 

1 .  Insert  String  x 

2.  Delete  String  x 

3.  Transaction  Begin 

4.  Transaction  End 

A  transaction  is  a  sequence  of  all  lazy  queries  and  updates  (Operations  1,2)  executed  between  the 
Transaction  Begin  and  Transaction  End.  When  the  Transaction  End  is  executed,  all  queries  and 
updates  made  since  the  Transaction  Begin  are  checked  for  conflicts  with  the  queries  and  updates 
made  by  concurrent  transactions.  If  there  is  a  conflict,  the  transaction  is  aborted  and  the 
Transaction  End  returns  an  error. 
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The  updates  made  within  a  transaction  do  not  change  the  B-tree  immediately.  Instead,  these 
updates  are  accumulated  at  the  client  side  in  a  set  of  inserted  strings  I  and  a  set  of  deleted  strings 
D.  The  B-tree  strings  remain  unaffected.  The  insert  and  delete  operations  work  as  follows: 

msert(x:)  =  {  D  :=  D  -  {jc};  I  :=  I  u  {x}  } 
delete(jc)  =  { I  :=  I  -  {x};  D  :=  D  u  {x}  } 

When  Transaction  End  is  executed,  the  set  D  is  deleted  from  the  B-tree  and  the  set  I  is  inserted 
into  B-tree: 

B  :=  (B  -  D)  u  I 

During  the  accumulation  of  a  transaction  into  sets  D  and  I,  our  concurrency  control  algorithm  at 
the  client  also  accumulates  a  set  V  to  be  used  for  backward  validation.  The  set  V  contains  the 
specification  of  each  subinterval  read  by  a  query  within  the  transaction  and  a  timestamp  of  this 
reading.  A  subinterval  is  a  subrange  within  a  query  which  subrange  was  physically  retrieved 
from  one  database  partition  at  one  logical  moment  in  time.  The  logical  time  at  a  given  database 
partition  is  incremented  every  time  a  committed  transaction  physically  changes  that  partition. 
The  subintervals  are  stamped  with  this  logical  time  and  a  number  that  identifies  the  partition  in 

the  system.  Thus  the  set  V  is  {([4,  r*],  4,  p*  )a=i},  where  4  is  the  timestamp  and  /»*  is  the 
partition  number. 

In  our  validation  technique,  when  committing  a  transaction  T,  the  system  does  not  need  to 
remember  the  results  of  T’s  queries;  it  remembers  only  query  specifications  [/,  r],  which  are 
checked  against  concurrent  transactions  at  T’s  commit  time.  The  validation  is  done  against 
transaction  queues,  normally  without  any  disk  access. 

Lazy  queries  can  be  used  to  further  reduce  the  validation  specified  by  the  set  V  and  improve  the 
granularity  in  conflict  detection.  Previous  examples  have  shown  that  the  user  does  not  actually 
retrieve  all  facts  from  the  lazy  query  interval.  The  intersection  of  lazy  queries  uses  the  Seek 
operation  and  actually  retrieves  only  a  few  strings  from  the  original  elementary  queries.  In  our 
implementation,  a  lazy  query  automatically  keeps  track  of  those  string  subranges  that  have 
actually  been  by  the  user.  This  union  of  subranges  can  be  much  smaller  than  the  union  of  the 
original  elementary  query  intervals.  This  results  in  a  finer  transaction  granularity  and  smaller 
conflict  probability.  At  the  end  of  transaction  execution,  the  string  subranges  from  all  lazy 
queries  are  further  optimized  by  merging  intersecting  subranges  of  all  lazy  queries.  This 
optimization  is  done  at  the  client  side,  which  allows  us  to  reduce  the  server  load  and  the 
transaction  execution  time. 

An  accumulated  transaction  is  a  triple  T(l,  D,  V)  of  strings  to  be  inserted  I,  strings  to  be  deleted 
D,  and  string  intervals  V  to  be  verified. 

Note  that  even  if  no  updates  were  made,  a  transaction  is  still  necessary  to  ensure  the  consistency 
of  queries.  Thus,  a  query  can  produce  an  accumulated  transaction  T(I,  D,  V)  with  empty  sets  D 
and  I. 
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6.2.  Validation  Protocol 

Validation  is  necessary  to  ensure  two  important  properties  of  transactions:  serializability  and 
external  consistency.  Serializability  means  that  the  committed  transactions  can  be  ordered  in 
such  a  way  that  the  net  result  would  be  the  same  as  if  transactions  ran  sequentially,  one  at  a  time. 
External  consistency  means  that  the  serialization  order  is  not  arbitrary:  if  transaction  S 
committed  before  T  began  (in  real  time),  S  should  be  ordered  before  T. 

When  a  client  commits  a  transaction,  the  accumulated  transaction  T  is  delivered  to  one  of  the 
database  servers.  This  database  server  is  called  the  transaction’s  originator.  The  transaction 
originator  splits  the  arriving  transaction  into  subtransactions  Ti  according  to  the  partitioning  map 
and  distributes  the  subtransactions  among  the  database  partitions.  A  subinterval  ([4,  r*],  4,/?*) 
in  the  set  V  is  distributed  to  the  partition  (without  consulting  the  partitioning  map).  This 
allows  to  detect  conflicts  with  system  transactions  that  perform  load  balancing,  which  may 
change  the  partitioning  map. 

The  transaction  originator  uses  the  2-phase  commit  protocol  to  update  the  database.  In  the  first 
phase,  the  transaction  originator  distributes  the  subtransactions  among  the  database  partitions. 
Each  database  partition  verifies  that  no  conflicts  with  any  other  transaction  is  possible  and  sends 
a  “ready”  or  “failed”  message  to  the  transaction  originator.  If  the  transaction  originator  receives 
a  “failed”  message,  it  immediately  aborts  the  other  subtransactions  and  notifies  the  client.  When 
all  database  partitions  return  a  “ready”  message,  the  transaction  originator  sends  a  “commit” 
message  to  the  participating  partitions. 

In  a  backward  validation  protocol,  the  arriving  subtransaction  T,(I,,D,-,Vj)  is  checked  against  all 
transactions  already  validated  successfully.  In  our  B-tree,  each  partition  maintains  a  log  of 
recently  committed  transactions  CL  and  a  log  of  transactions  waiting  for  commit  WL. 

We  say  that  a  set  of  string  intervals  V  intersects  a  set  of  strings  A  iff  there  exists  an  interval  [/,  r] 
in  V  such  that  [/,/■]  A^0  (i.e.  for  some x  e  A:  /  <x  <r  ). 

We  also  say  that  two  transactions  T(l7-,  Dj-, \j)  and  S(l5,  D^,  V^)  intersect  if: 

1 . 1 j'  ^  0  or  D  f  ^  0 

or 

2.  V5  intersects  If  u  Dj- 
or 

3.  Vj-  intersects  1^  u 

When  the  subtransaction  T,-  arrives,  it  is  verified  that  T,-  intersects  with  no  transaction  S  in  WL. 
Additional  verification  is  necessary  to  ensure  that  no  query  in  T,-  is  affected  by  a  recently 
committed  transaction  S  in  CL.  We  check  that  each  interval  ([4,  r*],  4,/i*)  in  V,-  of  T,-  does  not 
intersect  with  the  sets  and  of  any  transaction  S  in  CL  that  has  greater  timestamp  than  4 . 

If  the  subtransaction  is  successfully  verified,  it  is  appended  to  the  WL  and  the  “ready”  message 
is  sent  to  the  transaction  originator,  otherwise  the  “failed”  message  is  sent  to  the  transaction 
originator. 
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While  normally  not  requiring  any  disk  access,  this  algorithm’s  CPU  time  is 

0(||V,||x  (||WL||  +  ||CL|V,||))  where  CL|V,.  are  the  committed  transactions  younger  than  an 

average  query  in  V,  .  The  algorithm  can  be  significantly  accelerated  by  merging  small  sets  in  CL 
and  WL. 

7.  Proof  of  Correctness 

Our  concurrency  control  algorithm  satisfies  both  serializability  and  external  consistency 
requirements.  Consider  two  arbitrary  transactions  T  and  S.  Any  two  subtransactions  T,-  and  S*  of 
T  and  S  for  different  partitions  in  our  B-tree,  are  disjoint,  i.e.  T,-  does  not  intersect  S*  for  all  i  ^ 

k.  Because  of  this  and  the  definition  of  intersection,  T  and  S  intersect  iff  for  some  partition  i  T,- 
intersects  S,- . 

Definition:  Relation  "tt  "  between  transactions  that  have  been  successfully  performed  in  our 
system:  T  Tt  S  if  any  of  the  following  conditions  (a)  and/or  (b)  holds: 

(a)  There  exists  a  partition  i  such  that  T,-  intersects  S;  and  T,-  entered  CL,-  before  S< . 

(b)  The  programmatic  transaction  that  generated  S  began  execution  at  the  client  after  the 
commit  of  T  had  been  acknowledged  to  the  user,  i.e.  T  completed. 

Lemma  1.  The  relation  Tt  is  acyclic. 

Consider  an  arbitrary  number  n  of  transactions  and  assume  that  there  is  a  cycle: 

71  Tt  ...  Tt  T"  Tt 

Choose  an  arbitrary  partition  j  where  T’j  has  entered  ClJ.  Let  tg  be  the  physical  moment 
of  time  of  said  entry.  Relation  T"  Tt  implies  one  of  two  cases  at  time  to : 

l.  Condition  (b):  The  transaction  T"  is  already  committed. 

2.  Condition  (a):  There  exists  a  partition  i  such  that  T",  intersects  T^,  and  T",  entered  CL,-  before 
TV 

In  both  cases  there  exists  a  partition  of  T"  in  which  the  physical  moment  of  time  when  the 
subtransaction  of  T"  enters  CL  is  less  than  Iq. 

In  case  1,  T"  is  already  committed  and  all  subtransactions  of  T"  are  in  CL  by  the  time  to.  In  case 
2,  the  fact  that  enters  CLy  at  the  time  to  and  the  2-phase  commit  protocol  implies  that  the 
other  subtransactions  of  T^  must  be  either  in  WL  or  CL  at  the  moment  to.  Thus,  the 
subtransaction  T^,  must  be  in  WL  or  CL  at  the  moment  to.  According  to  our  backward 
validation  protocol  T",-must  be  in  CL  at  this  time  (otherwise  if  T",-  is  in  WL,  the  intersecting 
subtransaction  T^,-  would  be  aborted). 

Repeating  the  same  argument  n  times  for  the  relations  T"‘^  tt  T",  ...,  T^  tt  T^  we  conclude  that 
there  exists  a  partition  k  of  T^  in  which  is  already  in  CL^  before  time  to . 
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Thus,  for  every  partition  j  of  there  exists  a  partition  k  in  which  enters  CL  earlier  than  T^j 
enters  CL^ .  Since  has  a  finite  number  of  partitions  this  implication  is  false,  in  contradiction  to 
the  initial  assumption. 

Therefore,  "  n  "  is  acyclic.  I 

Theorem;  Our  schedule  is  serializable  and  externally  consistent. 

Proof: 

Consider  an  arbitrary  set  of  transaction  that  have  successfully  completed  in  the  system.  By  the 
previous  Lemma,  "  n  "  imposes  a  partial  order  on  them.  According  to  the  known  theorem  that 
"for  every  partial  order  there  exists  a  total  order  that  preserves  the  partial  order",  there  exists  a 
total  order  "TTTt"  preserving  "n  ".  The  order  "ttti"  defines  a  serialization  of  transactions 
T^,...,T'".  The  series  preserves  external  consistency  because  "n"  does,  by  its  definition.  It 
remains  to  show  that  every  transaction  T*  in  the  series  saw  the  database  in  a  state  after  all  the 
previous  transactions  and  before  any  subsequent  transactions  in  the  series,  i.e.  every  query  [l,r\ 
in  V*  had  resulted  in 

(1)  [/,r]2;(r,D') 

i<k 


Consider  an  arbitrary  query  ([/,/*],  t,  p)  in  .  We  need  to  prove  that 


(2)  [/,'•]  E  (i;,d;) 

/:T  '  updated  i  <  k 

before  t 


The  left  sum  is  what  the  query  [/,r]  actually  saw  at  the  partition  p,  while  the  right  sum  is  what  it 
should  have  seen  if  serializability  requirement  is  fulfilled.  From  the  set-theoretical  definitions  of 
queries  and  transactions  it  follows  that: 

[hr](A  +  B)  =  [l,r]A  +  [l,r]B. 

Therefore,  (2)  is  equivalent  to: 

(3)  E  [/,r](i;,D;)  =  2 

/:T  '  iipd  ated  i  <  k 

before  / 


Assume  that  (3)  is  not  true.  Consider  two  cases: 

Case  1. 

The  left  sum  in  (3)  has  an  extra  addend  [l,r]  0,  k  <  j,  i.e.  the  query  [l,r]  may  have 

seen  the  results  of  transaction  T'  that  was  serialized  after  T*.  This  implies  that  T*  intersects  T' 
and  T'  entered  CLp  before  the  moment  t  and,  therefore,  before  T*. 
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Since  the  total  order  relation  7i7t  preserves  the  partial  order  relation  7t ,  T*  TtTt  T'  implies 
not(T^  71  T*).  By  the  definition  of  relation  7t  ,  this  in  turn  implies  that  there  is  no  partition  p 
such  that  T'  intersects  T*  and  T'  entered  CL^  before  T*.  Thus,  we  have  a  contradiction  and  this 
case  can  never  occur. 

Case  2.  The  left  sum  does  not  have  some  addend  [/,/■]  (I-'p,  D^p)  0,  j  <  k,  i.e.  T*  did  not  see 
some  updates  of  transaction  T'  ordered  before  T*.  Like  in  the  previous  case,  T'  7i7t  T*  implies 
not(T*  71  T  0  and,  therefore,  there  is  no  partition  p  such  that  T*  intersects  T'  and  T*  entered 
CLp  before  T' .  Since  [l,r\  (I^p,  D ^p)  0,  T'  intersects  T*  at  the  partition p.  Therefore,  T'  should 

have  entered  CLp  before  T*  .  Thus,  there  is  a  moment  when  T'  is  in  CLp  and  T*  is  in  WLp. 
Since  the  left  sum  does  not  have  the  addend  (I  ^  D  ^  must  have  entered  CLp  after  the 
moment  t.  When  T*  is  checked  by  the  backward  validation  protocol  at  the  partition  p,  according 
to  our  query  verification  algorithm,  each  query  ([/,  r]  t,  p)  of  T*  should  be  verified  against  all 
transactions  in  CLp  with  timestamp  greater  than  t.  Since  [/,r](I  ^p,  D  ^p)  5^  0  for  some 

transaction  T',  the  verification  fails  and  T*  is  aborted.  Thus,  in  our  case  T*  would  have  been 
aborted,  and  we  have  a  contradiction  to  our  assumption  that  T*  is  a  committed  transaction. 

Thus,  the  equation  (3)  is  valid  and  our  schedule  of  transactions  is  serializable.  I 

8.  On  Load  Balancing 

A  query  of  transaction  execution  time  is  determined  by  its  slowest  subquery  or  subtransaction. 
Thus,  load  balancing  is  essential  in  our  system  to  equalize  the  data  and  transactional  load  among 
B-tree  partitions.  In  our  system,  load  balancing  is  performed  as  a  series  of  load  balancing 
transactions  that  transfer  small  string  intervals  from  one  partition  to  another  [Rishe&al-96]. 

Load  balancing  transactions  require  a  large  amount  resources  and  time  to  move  the  data.  Thus, 
the  load  balancing  transactions  should  be  well  coordinated  in  the  system.  Since  the  data 
movement  time  is  large,  a  centralized  algorithm  performs  well  without  becoming  a  system 
bottleneck.  In  our  B-tree,  a  centralized  load  balancing  module  periodically  collects  data  and  load 
distribution  statistics  from  all  partitions  and  heuristically  generates  a  data  distribution  policy, 
initiating  the  load  balancing  transactions  executed  at  the  B-tree  partitions.  Apart  from  the  load 
balancing  transactions,  the  load  balancing  module  can  create  a  new  partition  or  delete  an  empty 
partition  which  is  no  longer  referred  to  by  any  other  B-tree  partition.  Our  concurrency  control 
algorithm  maintains  its  safety  and  efficiency  notwithstanding  load  balancing  activity  done 
according  to  our  algorithm  of  [Rishe&al-96]. 

9.  Client  String  Cache 

The  query  performance  can  be  improved  even  further  by  using  a  client  B-tree  string  cache  which 
accumulates  all  strings  recently  retrieved  from  all  database  partitions.  This  allows  to  perform 
many  frequently  executed  queries  in  zero  server  accesses.  The  B-tree  cache  stores  all  strings 
retrieved  from  the  database  servers  along  with  the  timestamp  attribute  and  the  partition  number 
for  each  string. 

Since  the  attributes  of  an  object  constitute  lexicographically  close  strings,  it  is  likely  that  after 
requesting  a  string  the  client  will  later  request  lexicographically  close  strings.  Thus,  the  cache 
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performance  can  be  improved  by  prefetching  more  than  one  B-tree  block  that  contain  the 
requested  query. 

The  client  cache  significantly  improves  the  performance  of  our  optimistic  concurrency  protocol. 
When  a  client  transaction  fails  due  to  a  conflict  ^vith  another  transaction  which  affects  some 
queries,  a  list  of  failed  (affected)  queries  is  returned  to  the  client.  All  failed  string  ranges  are 
invalidated  in  the  client  cache,  and  when  the  transaction  is  executed  again,  the  client  retrieves  the 
invalidated  ranges  from  the  server.  When  the  transaction  runs  the  second  time,  most  of  the  other 
ranges  that  the  user  retrieves  are  already  in  the  string  cache  and  the  number  of  server  accesses  is 
much  smaller.  This  can  significantly  improve  the  transaction  execution  time  at  the  second 
attempt.  Since  the  conflict  probability  is  proportional  to  the  execution  time,  the  conflict 
probability  is  also  reduced. 

10.  Livelocks  and  Exceptional  Pessimistic  Locks 

In  rare  situations,  a  transaction  may  be  repeatedly  aborted  and  reexecuted,  causing  a  livelock.  To 
avoid  this,  we  introduce  a  pessimistic  element  in  our  concurrency  control.  When  a  privileged 
user  desires  to  protect  a  new  transaction,  the  user  can  also  designate  it  as  pessimistic. 

A  transaction  in  the  pessimistic  mode  has  a  unique  logical  identifier  assigned  by  the  system.  This 
identifier  has  smaller  value  for  older  transactions  or  transactions  with  higher  priority.  A 
transaction  executed  in  the  optimistic  mode  is  “assigned”  an  infinitely  large  logical  identifier. 

A  transactional  query  in  a  pessimistic  mode  does  not  use  cache,  but  is  sent  directly  to  the 
corresponding  database  partitions.  Each  partition  maintains  an  additional  log  called  PL  which 
logs  all  pessimistic  transactional  queries  as  soon  as  they  are  executed.  Whenever  a  new 
transaction  T  conflicts  with  a  query  in  PL  that  has  a  smaller  logical  identifier  than  T,  T  is 
aborted.  To  ensure  that  a  queiy  in  pessimistic  transaction  does  not  conflict  with  any  transaction 
which  is  ready  to  commit,  the  query  execution  should  be  delayed  until  there  is  no  transaction  in 
WL  that  conflicts  with  the  query.  When  an  accumulated  pessimistic  transaction  T(I,D,V) 
arrives,  its  verification  should  be  delayed  until  there  is  no  transaction  in  WL  that  conflict  with 
the  sets  D  or  I.  A  query  is  removed  from  PL  when  its  transaction  arrives  for  verification  or  after 
a  timeout  period,  whichever  happens  first. 

This  algorithm  guarantees  successful  execution  of  the  transaction  with  the  smallest  logical 
identifier. 

11.  Conclusion 

In  this  paper  we  described  our  efficient  concurrency  control  algorithm  that  we  used  in 
implementation  of  a  parallel  B-tree  server.  This  algorithm  has  very  high  granularity  while 
avoiding  high  storage  and  processing  time  overheads.  Our  algorithm  uses  logical  clocks  and 
does  not  require  physical  clock  synchronization.  The  B-tree  server  can  handle  variable  length 
keys  and  can  be  used  in  a  variety  of  databases,  including  relational,  object-oriented,  and 
semantic.  Many  of  our  B-tree  features  (data  compression,  concurrency  control,  lazy  queries) 
have  been  implemented  in  C++  and  tested.  Preliminary  results  demonstrated  very  good 
performance. 

Lazy  queries  in  a  B-tree  can  significantly  improve  server  performance  in  case  of  complex  and 
large  queries.  Lazy  queries  can  also  decrease  the  transaction  conflict  probability,  which  is 
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essential  for  on-line  transaction  processing  systems  where  high  contention  workloads  are 
common.  Our  algorithm  has  a  very  fine  granularity  (attribute  or  string  level  granularity),  which 
also  contributes  to  smaller  transaction  conflict  probability. 
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We  ore  developing  a  massively  parallel  semantic  database  machine.  Our  basic  semantic  storage  stnu’ture  ensures  balanced  load  for  most 
parts  of  the  database.  The  load  to  the  other  parts  of  the  database  is  kept  balanced  by  a  heuristic  algorithm  which  repartitions  data  among 
processors  in  f>ur  database  machine  as  necessary'  to  produce  a  more  evenly  balanced  load.  We  present  our  ine.xpensive.  dynamic  load  brdanc- 
ing  method  together  with  a  fault-  tolerant  data  tran.sfer  policy  that  wdl  be  used  lo  transfer  the  repartitioned  data  in  a  wav  transparent  to  the 
users  of  the  database. 

Keywords:  DBMS,  massive  parallelism,  semantic  data  models,  load  balancing,  database  machine 


L  INTRODUCTION 

Database  management  systems  arc  emerging  as  prime  targets 
for  enhancement  through  parallelism.  In  order  for  parallel 
database  machines  lo  be  efficient,  the  processors  in  the  sys¬ 
tem  must  have  comparable  load.  A  massively  parallel 
database  machine  which  uses  thousands  of  processors  will 
allow  for  massive  throughput  of  transactions  and  queries  if 
no  processcKs  become  a  bottleneck.  This  paper  proposes  a 
load  balancing  method  for  a  massively  parallel  semantic 
database. 

Much  work  on  load  balancing  for  relational  databases  and 
file  systems  has  been  done  and  can  be  utilized  in  our 
research.  For  example,  Sitaram  et  «/.  ‘  propose  several 
dynamic  load  balancing  policies  for  multi-server  file  sys¬ 
tems.  A  dynamic  load  balancing  algorithm  for  large,  shared- 
nothing,  hypercubc  database  computers  which  makes  use  of 
relational  Join  strategies  is  presented  in  Hua  and  Suv  Lee 
and  Hua-^  pre.sent  a  self-adjusting  data  distribution  scheme 
which  balances  computer  workload  in  a  multiprocessor 
database  system  at  a  ceil  level  during  query  processing.  A 
run-time  reorganization  scheme  for  rule  based  proce,ssing  in 
large  databases  is  discussed  in  Stolid  et  aid. 

Our  database  computer  will  make  use  of  a  shared-nothing 


architecture.  The  computational  load  on  each  processor  of 
our  database  computer  will  vary  directly  with  the  demand  for 
data  on  that  prtx'essor.  Imbalances  in  the  number  of  data 
accesses  among  nodes  can  be  rectified  by  repartitioning  the 
daiaba.se.  much  as  imbalances  in  computational  demands  in 
process  scheduling  can  be  rectified  by  moving  processes 
from  one  machine  to  another.  When  a  range  of  facts  in  our 
database  is  moved  from  one  processor's  control  to  another 
processor's  control,  the  load  on  the  first  processor  will  go 
down.  The  methods  for  determining  imbalances  in  our  sys¬ 
tem,  and  the  methods  to  relieve  these  imbalances  in  our  sys¬ 
tem.  are  very  similar  to  the  methods  used  for  computational 
dynamic  load  balancing  in  shared-nothing  computers.  An 
adaptive,  heuristic  method  for  dynamic  load  balancing  in  a 
message-passing  multicomputer  is  presented  in  Xu  and 
HwangL  A  semi-distributed  approach  to  load  balancing  in 
massively  parallel  multicomputer  systems  is  presented  in 
Ahmad  and  Ghafoor^. 

Our  massively  parallel  database  machine  architecture  makes 
use  of  a  distributed  system  of  many  processors,  each  with  its 
own  pennanent  storage  device.  This  shared-nothing  approach 
requires  that  any  load  balancing  operations  be  perfonned  by 
messairc  passing.  The  data  distribution  scheme  that  is  used  in 
our  database’  system  allows  load  balancing  to  be  achieved  by 
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data  rcpaniiioning  among  the  nodes  of  our  system. 

This  paper  refines  the  results  reported  in  Rishic  et  alJ  and 
extends  them  by  adding  a  fault  tolerant  data  transfer  policy 
for  data  rcpariilioning. 

2.  SEMANTIC  BINARY  DATABASE 
MODEL 

The  semantic  database  !m)dels  in  general,  and  the  Semantic 
Binary  Model  SBM  (Rishie^  and  others)  in  particular,  repre¬ 
sent  the  information  of  an  application's  world  as  a  collection 
of  elementary  facts  categorizing  objects  or  establishing  rela¬ 
tionships  of  \  arious  kinds  between  pairs  of  objects.  The  cen¬ 
tral  notion  of  semantic  models  is  the  concept  of  an  af?srract 
(ihjcct.  which  is  any  real  world  entity  that  we  wish  to  store 
informaii(m  about  in  the  database.  The  objects  are  catego¬ 
rized  into  classes  according  to  their  common  properties. 
These  classes,  called  ccitc^ories,  need  not  be  disjoint  -  that 
IS.  one  object  may  belong  to  several  classes.  Further,  an  arbi¬ 
trary  structure  of  sub-categories  and  super-categories  can  be 
defined.  The  representation  of  the  objects  in  the  computer  is 
in\isible  ti*  the  user,  who  perceives  the  objects  as  real-world 
entities,  whether  tangible,  such  as  persons  or  cars,  or  intangi¬ 
ble.  such  as  (Observations,  meetings,  or  desires.  The  database 
is  perceived  by  its  user  as  a  set  of  facts  about  objects.  These 
facts  are  of  three  types;  facts  slating  that  an  object  belongs  to 
a  category:  .xC:  facts  staling  that  there  is  a  relationship 
between  objects:  .xRx:  and  facts  relating  objects  \o  data,  such 
as  numbers,  texts,  dates,  images,  tabulated  or  analytical 
functions,  etc:  .xRv.  The  relatitonships  can  be  of  arbitrary 
kinds:  stating,  t(or  e.xample.  that  there  is  a  many-i(o-many 
relation  address  between  the  category  of  persons  and  texts 
means  that  one  person  may  have  an  address,  several  address¬ 
es.  or  no  address  at  all. 

3.  STORAGE  STRUCTURE 

.^n  efficient  storage  structure  for  semantic  models  has  been 
proptosed  in  Rishie^’-  The  collection  of  facts  forming  the 
database  is  represented  by  a  file  structure  which  ensures 
approximately  1  disk  access  to  retrieve  queries  of  any  of  the 
following  forms: 

1.  For  a  given  abstract  object  x,  verify/find  what  cate¬ 
gories  the  object  belongs  to. 

2.  For  a  given  category,  find  its  objects. 

3.  For  a  given  abstract  object  x  and  relation  R.  retrieve 
all/certain  y  such  that  xRy. 

4.  For  a  given  abstract  object  y  and  relation  R,  retrieve 
all/certain  abstract  objects  x  such  that  xRy. 

5.  For  a  given  abstract  object  x.  retrieve  (in  one 
access)  all  (or  several)  of  its  direct  and/or  inverse 
relationships,  i.e.  relations  R  and  objects  y  such  that 
xRy  or  yRx.  The  relation  R  in  xRy  may  be  an 
attribute,  i.e.  a  relation  between  abstract  objects  and 
concrete  objects. 

6.  For  a  given  relation  (attribute)  R  and  a  given  con¬ 
crete  object  y,  find  all  abstract  objects  such  that 
xRy. 


7."^  For  a  given  relation  (attribute)  R  and  a  given  range 

of  concrete  objects  [vj,  vy],  find  all  objects  x  and  y  s 
such  that  xRy  and  y  <  V]  S  yy.  | 

i 

The  entire  database  can  be  stored  in  a  single  file.  This  file  | 
contains  all  of  the  facts  of  the  database  (aC  and  xRy)  as  well  « 
as  additional  information  called  inverted  facts:  Cv.  Ry.x.  The  | 
inverted  facts  ensure  that  answers  to  queries  of  forms  2.  4.  6  | 
and  7  are  kept  in  a  contiguous  segment  of  data  in  the  f 
database  which  allows  them  to  be  answered  with  one  disk  I 
access.  The  direct  facts  xC  and  xRy  allow  the  database  to  t 
answer  queries  of  forms  1.  3,  and  5  with  one  disk  access.  | 
The  file  is  maintained  as  a  B-tree.  The  variation  of  the  B-tree  # 
used  allows  both  .sequential  access  according  to  the  lexico-  | 
graphic  order  of  the  items  comprising  the  facts  and  the  I 
inverted  facts,  as  well  as  random  access  by  arbitrary  prefixes  | 
of  such  facts  and  inverted  facts.  Facts  which  are  close  to  ^ 
each  other  in  the  lexicographic  order  reside  close  to  each  { 
other  in  the  file.  (Notice  that  although  technically  the  B-trec-  s 
key  is  the  entire  fact,  it  is  of  varying  length  and  on  the  aver-  | 
age  is  only  .several  bytes  long,  which  is  the  average  size  of  J 
the  encoded  fact  xRy.)  i 

Consider,  for  example,  a  database  containing  information  i 
regarding  products  manufactured  by  different  companies. 
The  following  set  of  facts  can  be  a  part  of  a  Ic^gical  instanta¬ 
neous  database: 

1.  ob\Qc{\  COMPANY 

2.  object  1  COMPANY-NAME  'IBM' 

3.  object  1  MANUFACTURED  ob}Qf:[2 

4.  object  1  MANUFACTURED  objecl3 

5.  object:  PRODUCT 

6.  object:  COST  3600 

7.  object:  DESCRIPTION  ‘Thinkpad’ 

8.  objeci3  PRODUCT 

9.  object3  COST  100 

10.  object:  DESCRIPTION  ‘TrackPoini' 

The  additional  inverted  facts  stored  in  the  database  are:  ,• 

1.  COMPANY  objccil  \ 

2.  COMPANY-NAME 'IBM' ob}cci\  ] 

3.  object:  MANUFACTURED-BY  object  1  [ 

4.  object:  MANUFACTURED-BY  object  1  \ 

5.  COST  3600  object:  [ 

6.  Ca57’ 100  object:  ; 

7.  DESCRIPTION  Thinkpad*  object:  j 

8.  DESCRIPTION  ‘TrackPoinF  object:  i 

9.  PRODUCT  ob]tca  | 

10.  PRODUCT  0b]tci2 

To  answer  the  elementary  query  “Find  all  objects  manufac-  : 
lured  by  object  1",  we  find  all  the  facts  whose  prefix  is  : 
ob']tc\\ _M AN U FACTORED.  denotes  concatenation.)  ; 

These  entries  are  clustered  together  in  the  sorted  order  of  | 
direct  facts.  ; 

To  answer  the  elementary  query  “Find  all  products  cost-  | 
ing  between  $0  and  $800".  we  find  all  the  facts  whose  prefix  I 
is  in  the  range  from  COST_0  to  C(7Sr_800.  These  entries  ^ 
are  clustered  together  in  the  sorted  order  of  inverted  facts.  | 
In  the  massively  parallel  version  that  we  are  developing,  I 
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the  B“tree  is  partitioned  into  many  small  fragments,  each 
residing  on  a  separate  storage  unit  {e.g.  a  disk  or  non-volatile 
memory)  that  is  associated  with  a  fairly  powerful  processor. 
This  disk-processor  pair  is  called  a  node.  Each  node  can 
retrieve  information  from  the  disk,  perform  the  necessary 
processing  on  the  data  and  deliver  the  result  to  the  user,  or  to 
the  other  nodes.  For  updates  the  node  verifies  all  of  the  rele¬ 
vant  integrity  constraints  and  then  stores  the  updated  infor¬ 
mation  on  the  disk.  Many  database  fragments  can  be  queried 
or  updated  concurrently. 

The  queries  and  transactions  will  enter  into  the  network 
through  host  interfaces.  Every  host  interface  maintains  a 
copy^of  the  Partitioning  Map  (PM)  of  the  entire  database. 
Since  the  whole  database  is  a  lexicographically  ordered  file 
represented  by  a  set  of  B-irees.  the  map  needs  to  contain 
only  a  small  number  of  facts  for  each  node:  the  lexicographi¬ 
cally  minimal  and  maximal  facts  for  each  B-tree  fragment 
that  is  stored  on  that  node.  The  map  changes  only  when  the 
database  is  re-partitioned.  The  distribution  policy  that  we 
propose  in  this  work  provides  repartitioning  that  is  rare, 
inexpensive,  hxali/.able,  invisible  to  the  system  until  all  of 
the  shifting  ol  data  is  complete,  and  that  does  not  interfere 
with  the  normal  operation  of  the  system. 

Most  of  the  physical  facts  that  are  in  our  implementation 
of  a  semantic  binary  database  start  with  an  abstract  object. 
These  facts  are  ordered  by  the  encoding  of  the  abstract 
objects,  which  assigns  a  unique  quasi-random  number  to 
each  abstract  t)biect.  Since  there  are  so  many  of  these  facts, 
and  since  tlie  objects  are  randomly  ordered,  we  can  assume 
that  traffic  to  each  partition  of  these  facts  will  be  balanced 
over  time.  Other  lacts  in  a  semantic  binary  database  start 
with  an  inverted  category  or  an  inverted  attribute  (i.e.  a  rela¬ 
tion  between  an  abstract  object  and  a  printable  value).  It  is 
possible  that  at  some  time  there  may  be  a  need  to  access  a 
certain  attribute  or  category  more  often  than  other  aitrihute.s 
or  categories.  The  same  may  he  true  lor  a  specific  range  of 
values  of  a  given  aiirihute.  Since  a!!  facts  that  refer  to  a  par¬ 
ticular  inverted  attribute  or  inverted  category  are  clustered 
together,  this  may  cause  a  higher  load  on  some 
processor/disk  pairs  than  on  others.  Since  load  imbalances 
can  occur  in  some  kinds  of  facts  hut  not  others,  the  tile  con¬ 
taining  the  facts  will  be  split  into  two  subliles.  The  first  sub¬ 
file  will  contain  all  the  facts  that  begin  with  an  abstract 
object.  The  second  subfile  wilt  contain  the  lacts  that  begin 
with  an  inverted  aiirihute  or  caicgt^ry.  Additionally  there  is  a 
third  subfile  ctmtaining  long  data  items:  texts,  images,  etc., 
which  are  pointed  to  by  facts.  Each  subfile  will  be  initially 
partitioned  evenly  over  all  the  processor/disk  pairs  in  the 
system.  The  first  subfile  is  already  balanced;  the  second  and 
third  subfiles  may  become  unbalanced  and  will  require  a 
block  placement  algorithm  that  allows  the  data  to  be  reparti¬ 
tioned.  By  rcparlilioning  data,  we  will  be  able  to  more  even¬ 
ly  balance  the  load  to  each  data  partition. 

4.  REQUEST  EXECUTION  SCHEME 

We  employ  a  deferred  update  scheme  for  tran.saction  pro¬ 
cessing.  This  means  that  transactions  arc  not  physically  per- 
Ibrmcd  until  they  are  committed,  but  are  accumulated  by  the 
database  management  system  as  they  are  run.  Upon  comple¬ 


tion  of  the  transaction  the  DBMS  checks  its  integrity  and 
then  physically  performs  the  update.  A  completed  transac¬ 
tion  is  composed  of  a  set  of  facts  to  be  deleted  from  the 
database,  a  set  of  facts  to  be  inserted  into  the  database,  and 
additional  information  needed  to  verify  that  there  is  no  inter¬ 
ference  between  transactions  of  concurrent  programs.  In  our 
parallel  database,  each  node  is  responsible  for  a  portion  of 
the  databa.se.  When  an  accumulated  tran.saction  is  performed, 
the  sets  of  facts  to  be 'inserted  into,  and  deleted  from,  the 
database  must  be  broken  down  into  subsets  that  can  be  sent 
to  the  processors  which  are  responsible  for  the  relevant 
ranges  of  data. 

Each  host  in  the  system  will  have  a  copy  of  the  Partition¬ 
ing  Map  (PM).  The  Partitioning  Map  is  a  small  semantic 
database  containing  information  about  data  distribution  in 
the  system.  Figure  I  is  a  semantic  schema  of  the  partitioning 
map. 

The  partitioning  map  contains  a  set  of  ranges  and  their 
lexicographical  bounds  -  the  low-hound  and  the  fiii^fi-hound 
values..  When  a  query  or  transaction  arrives,  the  host  will 
identify  its  lexicographical  bounds.  The  host  will  then  use 
the  partitioning  map  to  determine  a  set  of  ranges  that  needs 
to  be  retrieved  or  updated  and  hence  the  nodes  which  will  be 
involved  in  the  current  transaction  or  query. 

The  partitioning  map  will  be  replicated  among  hosts. 
However,  this  does  not  imply  that  wc  need  a  global  data 
structure;  the  algorithm  described  below  allows  updates  of 
the  partitioning  database  to  be  performed  gradually,  without 
locking  and  interrupting  all  hosts. 

A  range  can  be  obtained  from  the  node  pointed  to  by  the 
locarion  reference  in  the  partitioning  database.  This  node 
should  either  have  the  range  or  a  reference  to  another  node 
which  contains  the  range. 

To  perform  load  balancing  we  will  need  to  move  ranges 
from  one  node  to  another.  A  moved  range  will  be  accessible 
via  an  indirect  reference  that  is  left  at  its  previous  location. 
Such  an  indirect  access  slows  down  the  system,  especially 
when  the  range  is  frequently  accessed  by  users.  To  allow  a 
direct  access  to  the  mtn'ed  range  we  need  to  update  the  loca¬ 
tion  reference  in  the  partitioning  database.  We  will  not  per¬ 
form  this  update  simultaneously  for  all  the  host  interfaces. 
The  update  will  be  performed  when  a  host  executes  the  first 
query  or  transaction  that  refers  to  the  range  that  was  trans¬ 
ferred.  The-  node  that  actually  holds  the  range  will  .send  the 
results  to  the  host  along  with  a  request  to  update  the  parti¬ 
tioning  map.  This  means  that  the  first  transaction  will  have 
to  travel  a  little  further  than  all  subsequent  transactions.  The 
second  and  future  queries  or  transactions  made  through  this 
host  will  be  executed  directly  by  the  node  pointed  to  by  the 
location  reference. 

The  data  structure  at  each  node  which  supports  indirect 
referencing  will  be  exactly  the  same  as  the  partitioning  map 
described  above.  We  will  call  this  data  structure  a  local  parti¬ 
tioning  map. 

Each  range  of  facts  will  be  represented  as  a  separate  B- 
tree  structure  which  will  reside  on  the  node  pointed  to  by  the 
partitioning  map.  Consider  a  case  where  a  range  has  been 
moved  several  times  from  one  node  to  another.  We  may  have 
multiple  indirection  references  to  the  actual  location  of  the 
range.  These  indirect  references  will  be  changed  to  direct 
references  a^dcscribc?!  above. 
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Figure  1  P^irtiiioning  map 

5.  DATA  TRANSFER  POLICY 

In  order  to  ensure  lhat  ihc  database  remains  consistent 
ihroLighoui  a  load  balancing  data  transfer,  load  balancing 
actions  are  executed  as  transactions  initiated  by  the  system. 
A  large  range  of  lacts  is  transferred  by  executing  a  scries  of 
small  system  transactions  that  transfer  small  portions  of  data 
from  one  partition  to  another.  The  system  transactions  arc 
subieci  to  the  same  logging  and  recovery  actions  as  regular, 
user  initiated,  transactions.  Apart  from  the  data  transfer,  each 
small  load  balancing  transaction  also  includes  the  data  ncc- 
essarx  lor  uixiaiing  the  partiti^ming  map.  To  ensure  that  the 
partitioning  map  remains  ctmsistcnt.  the  partitioning  map 
update  is  executed  using  a  2-phase  commit  prtncK'ol. 

6.  LOAD  BALANCING  POLICY 

When  idle,  the  host  interfaces  will  send  data  and  work  load 
statistics  recently  accumulated  from  the  nodes  to  a  Global 
Performance  Anaiy/er  (GPA).  The  host  interfaces  accumu¬ 
late  this  data  as  the  results  of  queries  and  transactions  flow 
through  them  back  to  the  user.  The  GPA  is  a  process  lhat 
analxv.es  the  statistical  information  obtained  and  generates 
preferable  directions  of  data  transfer  for  each  node. 

The  statistics  report  xvill  contain  only  the  changes  since 
the  previous  report: 

•  Changes  in  data  partitioning 

•  Number  of  accesses  for  each  range 

•  Free  space  on  each  node 

The  GPA  xvill  u.sc  a  heuristic  search  algorithm  which  uses  a 
choice  function  to  select  a  small  number  of  possible  data 
movements  for  the  system.  The  final  stale  will  be  estimated 
by  a  static  evaluation  function  S.  The  GPA  will  select  the 
data  movement  with  the  lowest  value  of  the  resulting  static 
evaluation  S. 

The  choice  function  should  comply  with  the  following 
strategies: 


1.  Whenever  possible  load  balancing  should  be 
achieved  by  joining  ranges  together.  Joining  ranges 
will  result  in  faster  query  execution  and  smaller  par¬ 
titioning  maps. 

2.  A  criterion  for  detennining  preferable  destinations 
for  a  range  transfer  is  the  desire  to  move  a  range  to 
a  destination  node  which  contains  the  lexicographi¬ 
cally  closest  range  to  the  transferred  range.  In  other 
words,  it  is  desirable  to  locate  lexicographically 
close  ranges  on  the  same  node  whenever  possible. 

3.  If  a  range  has  an  exceptionally  high  number  of 
access  or  requires  an  exceptionally  large  amount  of 
storage  -  split  the  range  into  several  parts  and  trans¬ 
fer  them  to  other  nodes. 

Each  node  xvill  be  characterized  by  two  parameters: 

1 .  The  amount  of  free  disk  space  on  the  node,  h 

2.  The  percentage  of  idle  time  /.  In  other  xvords  the  / 
is:  I  =  Idle/T.  where  T  is  a  given  time  interval  and 
Idle  is  the  node's  idle  lime  during  the  lime  7. 

The  resulting  .state  will  be  estimated  by  the  following  param¬ 
eters: 

j  4  _  the  total  amount  of  data  that  will  be  necessary  to 

transfer  in  the  system 

2.  /)/-  -  the  mean  square  deviaiicm  of  F 

3.  Df  ~  the  mean  square  deviation  of  / 

4.  M  -  total  number  of  ranges  in  the  system 

The  static  cxaluaiion  function  can  he  represented  as: 

S  =  C,  D;  ^  C,  A/, 

xvhcrc  C|.  Q.  Cy  and  Cj  arc  constants. 

7.  CONCLUSION 

Our  load  balancing  algorithm  will  provide  our  massively 
parallel  semantic  database  machine  with  a  method  to  reparti¬ 
tion  data  to  evenly  distribute  work  among  its  processors.  The 
algorithm  has  very  little  overhead,  as  its  statistics  are  accu¬ 
mulated  during  the  normal  processing  of  transactions.  The 
load  balancing  is  accomplished  by  repartilioning  parts  of  the 
database  over  the  nodes  of  the  database  machine.  The  repar-  - 
litioning  will  be  transparent  to  the  users  and  will  not 
adversely  affect  the  performance  of  the  system.  Our  fault-  : 
tolerant  data  transfer  policy  will  ensure  that  the  databa.se  and  ? 
its  partitioning  maps  remain  consistent  during  reparlitioning.  : 

We  are  currently  developing  a  prototype  parallel  semantic 
databa.se  on  a  network  of  workstations.  We  will  evaluate  our  i 
load  balancing  algorithm  on  this  prototype  system  and  1 
experiment  with  ways  to  optimize  our  heuristic  search  algo-  I 
rithm.  } 
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SB2  Benchmark  (Consumer  Survey  Database) 

The  purpose  of  the  test  database  is  to  store  the  information  gathered  in  a  typical  consumer 
survey.  All  consumers  are  grouped  by  the  type  of  the  product  they  use.  These  groups  are 
represented  in  the  schema  by  subcategories  G0,G1,...G9.  A  consumer  usually  belongs  to 
several  groups.  Within  each  group  a  consumer  may  use  several  brands  of  the  same 
product.  The  integer  attributes  A0,A1,...A9  are  used  to  indicate  which  brands  the 
consumer  uses  in  the  order  of  preference.  Sometimes  a  consumer  may  want  to  enter  a 
comment  about  any  brand  he  uses.  Comments  for  the  corresponding  brands  are  stored  in 
the  attributes  C0,C1,...C9.  Comments  are  entered  very  rarely. 

The  problem  domain  can  be  easily  represented  in  a  semantic  schema.  A  relational  schema 
for  Oracle  allows  several  different  designs.  The  design  choice  impacts  both  the  space 
requirements  for  the  database,  and  the  efficiency  of  the  transactions  performed.  Designing 
a  relational  schema  we  must  take  into  consideration  the  characteristics  of  the  data  that  we 
intend  to  store  (which  is  not  required  for  the  semantic  schema  design).  Since  the  tables 
G1,G2,...G9  are  going  to  be  sparse,  we  have  two  reasonable  choices  for  the  schema 
design,  and  we  have  implemented  both. 

In  the  first  design,  which  we  call  "Sparse,"  groups  are  represented  as  different  tables  like 
in  the  semantic  schema.  In  the  second  design,  which  we  call  "Compact,"  all  the  data  for 
all  the  groups  is  stored  in  one  table  having  a  three-attribute  key  (consumer,  group  and 
brand).  This  Compact  approach  is  intended  to  save  space,  as  it  contains  rows  only  for 
those  brands  that  are  actually  used  by  the  corresponding  consumer. 

The  database  must  be  able  to  efficiently  respond  to  arbitrary  queries.  Therefore,  the 
relational  database  was  created  fully  indexed.  Further,  Oracle  was  allowed  to  gather 
statistics  on  the  database  prior  to  running  the  benchmark  transactions. 

In  the  Compact  relational  design  the  database  itself  required  less  space  than  in  Sparse 
relational  design,  but  the  total  occupied  space  including  indexes  was  comparable  for  both 
designs,  and  about  3  times  the  space  required  by  Semantic  Database.  The  benchmark  tests 
were  performed  for  two  different  database  sizes.  In  the  first  test,  the  initial  database 
contained  100,000  consumers,  in  the  second  -  500,000  consumers.  The  actual  database 
sizes  are  shown  in  the  results  section. 

Oracle  transactions  were  written  using  the  Embedded  SQL.  Transactions  for  the  Semantic 
Database  were  written  using  semantic  API.  The  same  row  data  files  were  loaded  into  the 
Oracle  and  the  Semantic  Database. 


The  initial  data  for  the  SB2  Benchmark  Database  (Consumer  Survey  Database) 


The  consumer’s  name,  address  and  comment  are  strings  of  random  alpha  characters.  The 
length  of  the  string  is  generated  as  a  random  number  with  a  Normal  Distribution  (a  table 
of  distribution  parameters  is  given). 

Each  consumer  can  belong  to  a  number  of  groups  and  within  that  group  he  can  use  a 
number  of  brands  of  the  product.  He  can  have  a  number  of  hobbies  and  use  a  number  of 
stores.  All  these  numbers  are  random  with  a  Normal  Distribution  and  parameters 
according  to  the  table. 

Comment  fields  are  filled  for  2%  of  corresponding  numeric  brand  preference  values. 

SSN  is  a  random  number  uniformly  distributed  in  the  range  100,000,000  ..  999,999,999. 

For  the  purpose  of  establishing  of  a  “knows”  relation  all  consumers  are  divided  into 
disjoint  sets  S,..Sn,  where  each  set  contains  exactly  10  consumers.  Then  5  random 
consumers  from  each  set  Sj  are  related  to  5  random  consumers  from  the  set  Si+|. 

There  are  20  different  stores  with  names  “Store  name  #1”  through  “Store  name  #20”  and 
types  “Store  type  #1”  through  “Store  type  #3”. 

Expenditure  is  a  random  number  with  a  Normal  Distribution  and  parameters  according  to 
the  table. 

Cid  is  the  ID  of  a  consumer,  assigned  sequentially. 


Table  of  Normal  Distribution  parameters: 


Lower 

Upper 

Mean 

Variance 

bound 

bound 

Name  length 

5 

40 

12 

5 

Address  length 

15 

100 

35 

20 

Comment  length 

5 

255 

30 

100 

Number  of  hobbies  per  consumer 

0 

19 

0 

10 

Number  of  stores  per  consumer 

1 

19 

4 

10 

Expenditure 

1 

89 

20 

10 

Number  of  groups  a  consumer  belongs  to 

1 

10 

5 

4 

Number  of  brands  a  consumer  uses 

0 

9 

1 

1 

SB2  Benchmark  (Consumer  Survey  Database) 
Transactions 


Transaction  1: 

How  many  consumers  are  in  the  intersection  of  the  ten  groups  G0..G9. 


Transaction  2: 

Create  a  new  group  GIO  and  populate  it  with  those  consumers  who  belong  to  both 
G1  and  G2  and  have  Al=l  in  G1  and  A2=2  in  G2. 


Transaction  3: 

How  many  consumers  are  customers  of  store  X  and  have  hobby  Y,  excluding 
those  who  belong  to  both  G3  and  G4  and  have  A3=3  in  G3  and  A4=4  in  G4. 


Transaction  4: 

For  each  person  from  a  given  (randomly  chosen)  set  of  0.1%  of  all  consumers, 
expand  the  relation  “knows”  to  relate  this  person  to  all  people  he  has  a  chain  of 
acquaintance  to.  Abort  the  transaction  rather  than  commit.  Print  the  length  of  the 
maximal  chain  from  the  person. 


Transaction  5; 

Calculate  the  number  of  consumers  in  each  group. 


SB2  Benchmark  (Consumer  Survey  Database) :  SDB  vs  Oracle 


SB2 


Database 

Semantic 

Oracle  Compact 

Oracle  Sparse  | 

#  of  consumers 

100K 

500K 

100K 

500K 

100K 

500K 

Source  size  (Mb) 

23.00 

123.00 

23.00 

123.00 

23.00 

123.00 

Database  size  (Mb) 

51.00 

207.00 

141.00 

648.00 

123.00 

580.00 

T 1  cold  (sec) 

5.76 

29.41 

42.10 

764.10 

32.11 

181.80 

T 1  hot  (sec) 

4.95 

25.62 

28.82 

727.95 

4.21 

21.65 

T2  cold  (sec) 

0.40 

1.55 

25.34 

153.40 

16.36 

36.61 

T2  hot  (sec) 

0.11 

0.29 

2.75 

13.88 

0.84 

1.20 

T3  cold  (sec) 

1.33 

5.84 

90.48 

390.70 

65.51 

276.20 

T3  hot  (sec) 

1.16 

5.55 

48.34 

389.63 

45.28 

223.28 

T4  cold  (sec) 

3.14 

11.58 

8.50 

23.64 

5.70 

28.65 

T4  hot  (sec) 

0.08 

0.28 

1.18 

10.22 

1.11 

7.43 

T5  cold  (sec) 

9.28 

45.87 

67.84 

187.60 

24.04 

29.90 

T5  hot  (sec) 

8.70 

43.86 

36.86 

187.83 

5.62 

28.05 

T1  cold 


500K 


100K 


142.10 
f  32.11 
^.76 

TJTTBD 

164.10 


T2  cold 


6^ 


500K 


i 

■ _ I  25.34 

i  100K  Mie.ae; 
cr4o 


ill  153.40 


0.00  200.00  400.00  600.00  800.00!  HOra-compact  :  :  0.00  50.00  100.00  150.00  200.00!  HOra-compact 


Seconds 


B  Ora-sparse 
■  Semantic 


Seconds 


j  EOra-sparse 
1  ■Semantic 


T3  cold 


T4  cold 


P90.70 


500K 

276.^0 

500K 

100K 

100K 

0.00  100.00  200.00  300.00  400.00! 

Seconds 


11  Ora-compact  i : 
I  ■  Ora-sparse 
i  ■  Semantic 


0.00 


T5  cold 


0.00  50.00  100.00  150.00  200.00 

Seconds 


CEI  Ora-compact 

■  Ora-sparse 

■  Semantic 


8.65 


10.00 


Seconds 


20.00  30  ooi  OB  Ora-compact  I  i 

{■Ora-sparse  ij 
I  i  : 

t  ■  Semantic  1  - 


Database  size 


500K 


141.00 

100K 

>3.00 


0.00  200.00  400.00  600.00  800.00 

Mb 


il  Ora-compact 

■  Ora-sparse 

■  Semantic 
□  Source 
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PERSON 

CORPORATION 

Name, Address:  String 

SSN:  Integer 

Name:  String  total 

Hobby:  String  m:m 
— >Knows  m:m: 

Address:  String  total 

GO 

/ 

G1 

G9 

a0,al,a2,a3,a4:  Integer 
a5,a6,a7,a8,a9:  Integer 
cOyCl ,c2,c3,c4,c5:  String 
c6,c7,c8,c9:  String 

a0,al,a2,a3,a4:  Integer 
a5,a6,a7,a8,a9:  Integer 
cO,c I, c2, c3, c4, c5:  String 
c6,c7,c8,c9:  String 

oOmI ,a2,a3,a4:  Integer 
a5,a6,a7,a8,a9:  Integer 
c0,cl ,c2,c3,c4,c5:  String 
c6,c7,c8,c9:  String 

Figure  1.  Semantic  Schema  for  SB2  Benchmark 
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PERSON 

KNOWS 

CORPORATION 

PID‘key:  Integer  1:1 
Name,  Address:  String 
SSN:  Integer 

PersonID-k:  Integer 
KnowsID-k:  Integer 

CorporationID-key:  Integer  I: I 
Name:  String  total 

Address:  String  total 

HAS  HOBBY 

HAS  STORE 

PersonID-k:  Integer 

ConsumerlD-k:  Integer 

HobbyID-k:  Integer 

StorelD-k:  Integer 

HOBBY 

CONSUMER 

STORE 

HobbyID-key:  Integer  I: I 
Name:  String 

CID-key:  Integer  1:1 
expenditure:  Integer 
Type, ID:  Integer 

StorelD-key:  Integer  1:1 
Name:  String  total 

Type:  String 

GROUPO 

GROUP9 

ConsumerlD-key:  Integer  1:1 

ConsumerlD-key:  Integer  1:1 

a0,al,a2,a3,a4:  Integer 

aO,aTa2,a3,a4:  Integer 

a5,a6,a7,a8,a9:  Integer 

a5, a6,a7, a8, a 9:  Integer 

c0,cl,c2,c3,c4,c5:  String 

c0,cl, c2, c3, c4, c5:  String 

c6,c7,  c8, c9:  String 

c6,c7,c8,c9:  String 

Figure  1.  Relational  Schema  for  SB2  Benchmark  (Sparse  alternative) 


GROUP 

ConsumerlD-k:  Integer 
Group-k:  Integer 
Brand,  Value:  Integer 
Comment:  String 


Figure  2.  Relational  Schema  for  SB2 
Benchmark  (Compact  alternative) 


Oracle  is  fully  indexed.  Best  results  are  chosen  for  Oracle  with/without  statistics. 
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We  have  adapted  SQL,  the  standard  relational  database  language,  to  semantic  databases.  The 
original  purpose  of  this  adaptation  was  to  be  compatible  with,  and  be  able  to  communicate 
with,  relational  tools.  Interestingly,  it  turned  out  that  the  size  of  a  typical  SQL  program  for  a 
semantic  database  is  many  times  smaller  than  for  an  equivalent  relational  database.  While  we 
have  previously  demonstrated  substantial  program-size  advantage  for  other  languages,  we  had 
not  anticipated  an  even  greater  advantage  with  SQL  —  a  specialized  language  for  relational 
databases. 

Our  ODBC  driver  for  the  SDB  Engine  is  fully  operational,  allowing  SQL  querying  of  a 
semantic  database  and  interoperability  with  relational  database  tools,  e.g.  end-user  systems 
like  MS  Access  Query-By-Example.  In  these  tools  the  number  of  user  keystrokes  required  is 
proportional  to  the  size  of  the  generated  SQL  program.  So  again,  savings  are  realized  and 
simplicity  is  attained  by  use  of  the  SDB  model. 

An  embedded  SQL  preprocessor  has  been  developed  and  is  fully  operational. 

Our  application  of  SQL  to  semantic  databases  allows  utilization  of  full  semantics  of  data, 
applies  to  scientific  and  spatial  data,  properly  treats  missing  values,  and  produces  queries 
which  are  typically  an  order  of  magnitude  shorter  than  if  written  in  SQL  for  an  equivalent 
normalized  relational  database  —  see  examples  in  Section  4. 

1.  SQL  INTERPRETATION 

We  use  the  same  syntax  as  the  standard  ODBC  SQL  (with  null  values).  However,  our  SQL 
queries  refer  to  a  virtual  schema.  This  virtual  schema  consists  of  an  inferred  table  T  defined 
for  each  category  C  as  a  spanning  tree  of  all  the  relations  reachable  from  C.  This  is 
recursively  defined  as  follows: 

Let  C  be  a  category. 

( 1 )  The  first  attribute  of  T; 

□  C  —  attribute  of  T,  range:  C  {m:l) 


{')  Vi  y?) 
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(2)  For  every  attribute  A  of  T,  for  every  relation  r  whose  domain  intersects  with  the  range  of 
A: 

□  —  attribute  of  T,  range:  mnge(r)  (m:l) 

provided  the  depth  of  recursion  does  not  exceed  the  system  variable  $MAXDEPTH 

If  the  original  relation  r  is  many-to-many  or  one-to-many,  the  new  attribute  would  be  many- 
to-one,  but  many  virtual  rows  would  exist  in  the  table  T,  one  for  each  instance  of  the  tree.  If 
r  has  no  value  for  an  object,  a  null  value  will  appear  in  the  virtual  relational  table. 

The  name  of  T  is  the  same  as  of  C,  . , 

The  attribute  names  of  T  contain  long  prefixes.:  These  prefixes  can  be  omitted  when  no 
ambiguity  arises,  i.e..  attribute  y  is  a  synonym  of  the  attribute  x_y  of  T  if  T  has  no  other 
attribute  z_y  where  depth(z)>=depth(x). 

We  note  that  the  range  of  a  virtual  attribute  may  be  of  multi-media  type:  numbers  with 
unlimited  varying  precision  and  magnitude,  texts  of  unlimited  size,  images,  etc. 

Prior  to  computing  the  virtual  tables,  we  eliminate  all  special  characters,  including 
underscores,  from  concept  names;  we  augment  the  schema  or  the  user-view  with  the 
following  virtual  relations: 

•  inverted  relations:  for  every  relation  R,  its  inverse  is  called,  by  default,  _R 

•  for  every  category  C,  a  surrogate  attribute,  also  called  C.  This  is  the  identity  attribute  on 
C.  It  can  be  used  for  checking  on  belonging  to  a  subcategory  (let  p  be  a  PERSON.;  p  is 
a  student  iff  p.STUDENT  is  not  null)  or  to  produce  a  printable  id  of  an  object  (see 
Appendix). 

•  for  every  category,  a  combined  attribute  C_,  which  is  the  concatenation  of  all  attributes 
of  C  that  are  representable  by  printable  strings  (this  includes  numbers,  enumerated. 
Boolean.  The  concatenated  values  are  separated  by  slashes.  Null  values  are  replaced  by 
empty  strings. 

•  infinite  virtual  relations  representing  functions  over  space-time,  which  in  the  actual 
database  are  represented  by  a  finite  data  structure. 


2.  TECHNICAL  NOTES 

2.1.  Definition  of  the  Extension  of  a  Table 

The  virtual  table  T  for  a  category  C  is  logically  generated  as  follows: 

(1)  Initially,  T[C]=C,  i.e.  T  contains  one  column  called  C  whose  values  are  the  objects  of  the 
category. 

(2)  For  every  attribute  A  of  T,  for  every  schema  relation  or  attribute  r  whose  domain  may  intersect 

range(A),  let  R  be  the  relation  r  with  its  domain  renamed  A  and  range  renamed  A _ r,  let  T  be  the 

natural  right-outer-join  of  T  with  R.  (Unlike  a  regular  join,  the  outer  join  creates  A_r=null  when 
there  is  no  match.) 
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2.2.  Surrogate  Attributes 

In  Release  0.1,  the  surrogate  attribute  of  each  semantic  category  is  the  internal  id  of  the  object. 

In  Release  0.2,  the  surrogate  attribute  will  be  defined  in  accordance  with  our  document  on  surrogates 
(see  Appendix). 

2.3.  User-control  of  Table  Depth 

(Used  only  by  sophisticated  users  trying  to  outsmart  $MAXDEPTH  defined  by  a  graphical  user 
interface;  not  needed  by  users  posing  direct  SQL  queries  without  a  GUI.) 

For  each  category  C,  in  addition  to  the  default  table  named  C,  of  depth  limited  by  $MAXDEPTH, 
there  are  also  tables  called  C_i  for  any  positive  integer  i,  with  the  depth  limited  by  i  rather  than 
SMAXDEPTH.  The  tables  C_i  are  not  returned  by  the  ODBC  command  requesting  the  list  of  all 
tables. 

2.4.  User-specified  Tables 

(Used  only  by  generic  graphical  user  interfaces;  not  needed  by  users  posing  direct  ODBC  SQL 
queries) 

Let  C  be  a  category.  Let  S={  A  j  ,...,  A^  }  be  some  unabbreviated  attributes  of  the  table  C  of  type 
Abstract-object  (i.e.  no  attribute  ends  with  an  actual  concrete  attribute  of  an  original  semantic 
category).  (Recall  that  the  name  of  C  is  a  prefix  of  each  Aj ). 

We  define  a  virtual  table  T(S)  as  the  projection  of  the  table  C  on  the  of  attributes  SPP  comprised  of  the 
attributes  S,  their  prefixes,  and  one-step  extensions  of  the  prefixes. 

(An  attribute  A  is  a  prefix  of  an  attribute  in  S  iff  A  is  in  S  or  A_w  is  in  S  for  some  string  w.  An 
attribute  B  is  a  one-step  extension  of  an  attribute  A  iff  B=A  or  B=A_w  where  w  contains  no 
underscores.) 

The  name  of  T  is  generated  as  follows:  for  each  A,-  let  B,  be  the  shortest  synonym  of  A,- .  The  name  of 
Tis:Bl _ ^B2 _ ....Bk 


2.5.  Semantics  of  Updates 

Release  0.2  supports  only  restricted  updates: 

delete  from  C  where  condition  Removes  objects  from  the  root  category  C  (does  not  delete  them  from 
supercategories  of  C). 

insert  into  C  attributes  values  assignments 

Creates  a  new  object,  places  it  in  the  root  category  C,  and  relates  it  to  some  one-step  attributes  (i.e.  the 
original  attributes/relations  of  category  C  and  their  inverses.) 

insert  into  C  attributes  query 

Evaluates  the  query,  resulting  in  a  set  of  rows.  For  each  row,  a  new  object  is  created  and  placed  in  C. 
It’s  one-step  relationships  are  assigned  values  from  the  rows. 

update  C  set  A  i=e  i,...yAk  =6k  where  condition 

Selects  a  set  of  object  of  category  C.  For  each  of  them  updates  some  one-step  attributes.  For  example, 
to  make  a  person  become  a  student:  update  PERSON  set  STUDENT=PERSON  where  condition . 
To  move  the  person  from  subcategory  of  students  to  subcategory  of  instructors:  update  PERSON  set 
STUDENT=null,INSTRUCTOR=PERSON 
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insert  into  C _ ... 

Allows  creation  of  multiple  relationships  R.  This  cannot  be  accomplished  with  previous  commands 
when  R  is  many-to-many. 

delete  from  C _ ^R  where  condition 

Allows  deletion  of  multiple  relationships  R. 


3.  EXAMPLES  OF  SEMANTIC  SQL  AND  COMPARISON  TO  RELATIONAL  SQL 

This  section  contains;  the  semantic  schema  of  a  Hydrology  application;  a  normalized 
relational  schema  of  the  same  application  (a  real  schema,  not  our  virtual  schema);  several 
SQL  statements  written  for  the  semantic  schema  and  (for  comparison)  for  the  relational 
schema. 

3.1.  Hydrology  Application,  Semantic  Schema 
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Figure  3-1.  Semantic  sub-schema  for  physical  observations. 
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3.2.  Relational  Schema  of  the  Hydrology  Application 


LOCATION 

east-  :Number;  e/evarion-j^:Number; 


north-UTM-in-key^\xrc^&x\ 
description :  String; 
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PHYSICAL-OBSERVATION-STATION-BELONGS-TO-ORGANIZATION 

physical-observation-station-id-in-keylniQ^Qf,  organization— name-in-key:Stxmg', _ 


ORGANIZATION-RUNS-PROJECT 

organization— name-in-key:Stnn^;  project— name-in-key:Str'm^; 


PHYSICAL-OBSERVATION-STATION-SERVES-PROJECT 

physical-observation-station-id-in-key:lnte^er;  project— name-in-key:Stnn^] 


ORGANIZATION-IS-PART-OF-ORGANIZATION 

organization— name- in-feey:  String ;  or^anization-2  -  -name-in-key:  String; 


Figure  3-3.  Relational  sub-schema  for  physical  observations.  Part  II:  tables 
representing  the  m:m  relationships. 
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3.3.  Program  Size  Comparisons:  SQL 


1.  List  of  the  time  and  housing  of  temperature  measurements  over  50  degrees 

SQL  statement  based  on  semantic  schema: 


select  housing, time  from  MEASUREMENT  where  of _ ^name=’Temperature’  and  value>50 

SQL  statement  based  on  relational  schema: 

select  housing,  time 

from  PHYSICAL_OBSERVATION_STATION,  MEASUREMENT 

where  exists 

(select  *  from  MEASUREMENT-TYPE 

where  name_key  =  of _ ^name  and  name_key  =  ’Temperature’  and 

by_physical_observation_station_id  =  physical_observation_station_id_key  and 
value  >  50) 

(9  Vi  97) 


Confidential.  HU/ltPDRC  Internal.  ©  HPDRC  and  N.Rwhe, 


9 


2.  The  descriptions  of  organizations  and  locations  of  their  fixed  stations 

SQL  statement  based  on  semantic  schema,  Alternative  1 : 

select  description,  belongs  to  located_at _ ^LOCATION  from  ORGANIZATION 

SQL  statement  based  on  semantic  schema.  Alternative  2: 
select  description,  LOCATION  from  ORGANIZATION 
SQL  statement  based  on  relational  schema: 

select  description,  LOCATION.north_UTM_in_key,  LOCATION.east_UTM_in_key 
from  ORGANIZATION,  LOCATION 
where  exists 

(select  *  from  FIXED.STATION 
where  exists 
(select  * 
from 

PHYSICAL_OBSERVATION_STATION_BELONGS_TO_ORGANIZATION 

where  name_key  =  organization _ ^name_in_key  and 

PHYSICAL_OBSERVATION_STATION_BELONGS_TO_ORGANIZATION. 
physical_observation_station_id_in_key  = 
FIXED_STATION.physical_observation_station_id_key  and 

located  at  north  UTM  =  north_UTM_in_key  and  located_at _ east_UTM  = 

east_UTM_in_key )) 
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3.  The  observations  since  January  1,  1993  (including  images,  measurements  and  their 
types)  with  location  of  the  stations 

SQL  statement  based  on  semantic  schema: 

select  OBSERVATION_,  of_,  LOCATION  from  OBSERVATION  where  time>’  1993/01  ’ 

SQL  statement  based  on  relational  schema: 


(select  MEASUREMENT_TYPE.*,  LOCATION.north_UTM_in_key, 

LOCATION.east_UTM_in_key,  MEASUREMENT.*,  NULL,  NULL,  NULL,  NULL, 
NULL,  NULL,  NULL,  NULL,  NULL 

from  MEASUREMENT.TYPE,  LOCATION,  MEASUREMENT 

where  time  >  ’1993/01’  and  exists  (select  *  from  FIXED_STATION  where 

by _ ^physical_observation_station_id  =  physical_observation_station_id_key  and 

located  at  north  UTM  =  north_UTM_in_key  and  located_at _ ^east_UTM  = 

east_UTM_in_key  and  of _ name  =  name_key  ))  union 

(select  MEASUREMENT_TYPE.*,  NULL,  NULL,  MEASUREMENT.*,  NULL, NULL, 
NULL,  NULL,  NULL,  NULL,  NULL,  NULL,  NULL 

from  MEASUREMENT.TYPE,  MEASUREMENT 

where  time  >  ’1993/01’  and  not  exists  (select  *  from  FIXED_STATION  where 

by _ physical_observation_station_id  =  physical_observation_station_id_key  and 

of _ ^name  =  name_key  ))  union 

(select  NULL,  NULL,  NULL,  NULL,  LOCATION.north_UTM_m_key, 

LOCATION.east_UTM_in_key,  NULL,  NULL,  NULL,  NULL,  NULL,  NULL, 
IMAGE.* 

from  LOCATION,  IMAGE 

where  time  >  ’1993/01’  and  exists  (select  *  from  FIXED_STATION  where 

by _ physical_observation_station_id  =  physical_observation_station_id_key  and 

located  at  north  UTM  =  north_UTM_in_key  and  located_at _ ^east_UTM  = 

east_UTM_in_key ))  union 

(select  NULL,  NULL,  NULL,  NULL,  NULL,  NULL,  NULL,  NULL,  NULL,  NULL,  NULL, 
NULL,  IMAGE.* 

from  IMAGE 

where  time  >  ’1993/01  ’  and  not  exists  (select  *  from  FIXED-STATION  where 
by _ physical_observation_station_id  =  physical_observation_station_id_key)) 
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3.4.  Handling  of  Interpolated  Spatial  Functions 

The  following  userview  contains  infinite  virtual  categories  POINT  and  BLOCK 
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Query  1:  temperature  yalue(s)  of  a  given  point 

SELECT  value  FROM  POINT  WHERE  name="Temperature"  and  <time,x,y>=... 

Query2:  areas  that  had  temperatures  between  1  and  1.1  degrees 
SELECT  time  1  ,time2,jc  i  ^  2>)'  i  2 

WHERE  name="Temperature"  and  value>=  1  and  value<=l.l 

4.  APPENDIX;  SURROGATES 


Object  surrogates 

For  some  categories  in  the  schema,  our  main  userview  contains  surrogates,  which  are  strings 
that  identify  objects  of  that  category.  These  surrogates  are  computed  virtual  attributes;  they 
are  used  in  our  database  languages,  e.g.  the  Semantic  SQL. 

The  database  schema  defines  semantic  keys  for  some  categories.  A  semantic  key  of  a 
category  C  is  a  set  of  relations  and  attributes  of  C  that  when  they  all  exist  (non-null)  jointly 
identify  the  objects  of  a  category.  To  compute  a  surrogate  from  a  semantic  key  we 
concatenate  the  values  of  the  key  attributes  and  relations  of  an  object,  replacing  any  abstract 
objects  in  the  key  relations  by  their  surrogates  if  the  latter  exists  (otherwise  the  whole 
surrogate  is  null). 

For  each  category  C  having  surrogates,  the  latter  are  represented  in  a  virtual  attribute: 

□  C-id  —  attribute  of  C,  range:  String  (1:1) 

It  is  formally  defined  as  follows. 

Let  k(C)  be  the  semantic  key  of  C  if  one  is  defined  in  the  semantic  schema.  (If  k(C)=(R  i,R2, 
...  Rn)  it  means  that  there  is  an  integrity  constraint: 

foreveryci,C2in  C: 

for  every  jc^,...,  xn  in  OBJECT: 

if  c  1  i?  j  x  1  and  ...  and  c  j  Rn  xn  and  C2  /? i  x  j  and  ...  and  cn  Rn  xn 
then  Cj  =  C2 

(The  relations  /?,•  do  not  have  to  be  total,  unlike  keys  of  relational  databases;  nor  do  they  have 
to  be  attributes;  they  have  to  be  m:l  or  1:1.) 

We  define  auxiliary  concatenation  operator  x|y:  If  x  or  y  is  null  then  the  result  is  null. 
Otherwise,  the  result  is  concatenation  of  x  and  y  separated  by  the  character  7’.  E.g. 
’abc’|’cde’=’abc/cde’ . 

The  surrogate  of  an  object  x,  s(x),  is  defined  as  follows  (null  if  any  part  is  undefined): 

If  x  is  a  string  then  s(x)=s. 
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If  X  is  a  concrete  object  other  than  string  (number,  Boolean,  date)  then  s(x)  is  a  conversion  of 
X  into  a  string. 

If  X  is  and  abstract  object  which  belongs  to  only  one  category,  C,  for  which  a  semantic  key  is 
defined  in  the  schema,  then: 

Let  (i?  j, ... ,  Rn)  be  the  alphabetical  ordering  of  the  semantic  key  of  C . 
s(x)=  s(x.Rl)  I  s(x.R2) ...  s(x.Rn) 
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